Linear-scaling DFT+U with full local orbital optimization 
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We present an approach to the DFT+U method (Density Functional Theory + Hubbard model) 
within which the computational effort for calculation of ground state energies and forces scales 
linearly with system size. We employ a formulation of the Hubbard model using nonorthogonal 
projector functions to define the localized subspaces, and apply it to a local-orbital DFT method 
including in situ orbital optimization. The resulting approach thus combines linear-scaling and 
systematic variational convergence. We demonstrate the scaling of the method by applying it to 
nickel oxide nano-clusters with sizes exceeding 7, 000 atoms. 
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I. INTRODUCTION 

The success of the Kohn-Sham formulation of density 
functional theory (DFT)^'^ is largely owed to its capa- 
bility of accurately and reliably reproducing the ground- 
state properties of quantum-mechanical systems. Two 
factors that limit the applicability of DFT are the com- 
putational expense of treating large systems, and the 
difficulties encountered in simulating so-called "strongly- 
correlated" systems. The realistic study of many techno- 
logically and biologically important structures requires 
the explicit treatment of very large system sizes, yet the 
asymptotic scaling of conventional DFT algorithms is cu- 
bic in the number of atoms, so that the feasible size 
limit for routine calculations is typically around 1,000 
atoms even on powerful high-performance computing ar- 
chitectures. Only by using DFT codes for which the ef- 
fort increases linearly with system size, recently reviewed 
in Ref. 3 and examples of which include ONETEP^"^, 
OPENMX^'^, and CONQUEST^-i", we may routinely 
bring first-principles simulation to bear on pertinent 
technological, environmental and medical problems. Fur- 
thermore, for many functional materials, most typically 
comprising open-shell first-row transition metal or lan- 
thanoid ions, DFT with local or semi-local functionals 
performs very poorly, failing to obtain qualitative agree- 
ment with experimental observations in the most severe 
cases. Many methods have been developed to overcome 
this deficiency, and here we focus on the DFT+[/^^ tech- 
nique due to its widespread adoption and its amenability 
to linear-scaling implementation. 

In this article, we present a computational method- 
ology to tackle the obstacles of large system size and 
strong correlation effects simultaneously. Working in the 
framework of a linear-scaling implementation of DFT in 
order to tackle the issue of system size, we fully detail a 
DFT+J7 implementation, including self-consistent total- 
energies and forces, and demonstrate computational scal- 
ing tests on a strongly interacting oxide system of over 



7, 000 atoms. Previous linear-scaling or otherwise large- 
scale implementations of DFT+C/, examples including 
Refs. 12-14, have relied on a basis of fixed user-defined 
or numerically pre-solved atomic orbitals. A notewor- 
thy advantage of our approach is that we allow for the 
optimization of the orbitals representing the Kohn-Sham 
density-matrix in situ, that is during the process of total- 
energy minimization, with respect to an underlying, sys- 
tematic plane- wave basis*'^^. In this manner, we move 
beyond the fixed-orbital approximation to linear-scaling 
DFT and DFT+f/. Using this approach, truly first- 
principles simulations may be carried out on systems 
comprising both strong electronic interactions and large 
spatial disorder, examples including layered transition- 
metal and lanthanoid oxide structures, catalytic surfaces, 
molecular magnets and organometallic biomolecules. 

The article is organized as follows. We describe 
the DFT+t/ technique and its generalization to the 
nonorthogonal case in Section II, after which we intro- 
duce linear-scaling DFT and define the notation and 
sparse matrix algebra for linear-scaling DFT+U in Sec- 
tion III. Minimization of the DFT+U total-energy with 
respect to the density-matrix is detailed in Section IV. 
The method is applied to nickel oxide clusters exceeding 
7, 000 atoms in Section V, and linear-scaling performance 
is demonstrated. Following some concluding remarks, we 
detail the method used to preserve the density-matrix 
idempotency and normalization in Appendices A and B, 
and to compute the DFT+U ionic forces in Appendix C. 



II. DFT+;7 METHOD FOR NONORTHOGONAL 
PROJECTORS 



The use of approximations such as the local spin- 
density approximation (LSDA)^^, for the exchange- 
correlation (XC) functional in Kohn-Sham DFT, is ap- 
propriate and highly successful in systems where the 
magnitude of each electron's kinetic energy t is large 
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compared with the Coulomb interaction U acting on it. 
In such systems, usually comprising elements whose 3d 
or 4/ atomic-like states are either completely empty or 
filled, the LSDA typically provides a good qualitative de- 
scription of both the ground-state density and the insu- 
lating gap. In strongly correlated systems such as Mott- 
Hubbard insulators^^, however, these states are localized, 
partially occupied, and do not fall in the regime oiU -^t. 
In such cases, the LSDA may thus perform very poorly 
unless it is corrected. The DFT+f/ method^^ reintro- 
duces the explicit Coulomb interaction terms, and thus 
the appropriate derivative discontinuity with respect to 
electronic occupation number, to the approximate XC 
functional. 

In the DFT + Hubbard U method (DFT+J/)", a 
number of spatially localized subspaces, sites labeled /, 
wherein the U <^ t regime is not expected to hold, are 
selected for supplementation with explicit Coulomb cor- 
relations beyond the LSDA level, retaining the bare, in- 
expensive XC functional for the remainder of the system. 
The strongly interacting subspaces are spanned by sets of 
localized orbitals, termed the Hubbard projectors {ipm^}. 
The selection of Hubbard projectors is a topic of interest 
in itself and possible choices include localized Wannier 
functions built from the Kohn-Sham eigenfunctions ac- 
cording to maximal localization^^' energy downfold- 
ing^'' or maximal Coulomb repulsion^^ criteria, or in- 
deed a total-energy minimization criterion in combina- 
tion with a self-consistency scheme, as we have proposed 
in a related article, Ref. 22. In the description of our 
linear-scaling method, here we assume only that the pro- 
jectors are confined to a spatial region, real-valued, and 
expressed in the same underlying, systematic basis as the 
orbitals representing the Kohn-Sham density-matrix (in 
the present case a truncated set of plane waves). 

In the tensorial representation, developed in order to 
maintain the tensorial invariance of subspace occupan- 
cies, moments, ionic forces and the total-energy^^, local- 
ized Hubbard projector duals are defined by^^ 

o^l = {^li^\^lih; \^^'^n = \^l'J)o^''-'-, (1) 

where, by definition, oi^i„0(^)™"™' = 5 ™', such that 

I ^ 'flL/th III' 

an individual metric tensor O*-^^ is generated and used 
for each subspace. The occupancy matrix is then most 
conveniently expressed as a mixed tensor (specifically a 
tensor with one contravariant index and one covariant 
index), following Refs. 23 and 25, so that its trace is a 
tensorial invariant, as per 

(2) 

Here, p'^'^^ is the single-particle density-matrix for elec- 
trons of spin cr, formally defined by 

(3) 



where is the occupancy of the Kohn-Sham orbital 

Using this definition, we can cast the rotationally- 
invariant DFT+f/ functional of Refs. 26 and 27 into a 
more general, tensorially invariant form - that is invari- 
ant under arbitrary linear combinations of the Hubbard 
projectors for a given site, following Ref. 23. Specifically, 
we use a tensorially invariant generalization of the widely- 
used, simplified DFT+J7 functional of Ref. 28, where the 
energy functional is given by Edft+u ~ Edft + Ejj, 
with 

la 

and [/^^^ is the screened subspace-averaged Coulomb re- 
pulsion. The DFT+t/ penalty functional approximately 
emulates the exact exchange-correlation functional by in- 
troducing a derivative discontinuity in the total-energy 
with respect to the occupancy matrix, in effect approxi- 
mately enforcing the unrestricted Hartree-Fock approxi- 
mation within the subspaces /. 

III. LINEAR-SCALING DFT+C/ 

We now describe the steps necessary to perform 
DFT+f/ calculations with linear-scaling expense. We 
have previously demonstrated the features of projector 
self-consistency^^ and tensorial invariance^'^, in our im- 
plementation of the method in the ONETEP code**'^, 
however the method described in this article is applicable 
to linear-scaling DFT methods^ generally. The method 
is rigorously general to the case of nonorthogonality of 
both the local orbitals and the Hubbard projectors. 

A. Framework and notation 

Linear-scaling DFT revolves around the optimization 
not of the eigenstates of the Kohn-Sham Hamiltonian, 
but of the density- matrix of Eqn. 3 expressed in terms of a 
set of nonorthogonal local orbitals, {|0a)} (known in the 
ONETEP code as Nonorthogonal Generalized Wannier 
Functions, NGWFs^^), that is 

p(-) -|0„)A-(-)"^(0^|, (5) 

where the tensor K is known as the density kernel and 
is generally non-diagonal. The exponential spatial local- 
ization of the density matrix for insulating materials^^, 

p'-''Hr,r') = (rlp('^)lr') - cxp (-7|r - r'|) , (6) 

must be exploited to achieve linear-scaling, by strictly 
limiting the spatial extent of the local orbitals and trun- 
cating the density kernel to an appropriate length-scale. 

The contravariant duals of the local orbitals are de- 
noted by and the contravariant metric on the 
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orbitals is the inverse of the covariant metric S, so that 

Sap^iKici^p), s"^ = {r\4>^') = {s-y^ . (7) 

We emphasize that the metric S on the local orbitals, 
generally a large matrix containing information about 
the entire system, and the individual metric O^^^ on 
each DFT+f/ subspace, a small matrix (usually 5x5 for 
Sd-type subspaces) and a localized quantity, are distinct 
even in cases where the Hubbard projectors are selected 
as a proper subset of the local orbitals. 

The occupation matrix for each subspace is expressed 
in terms of local orbital matrix elements by inserting the 
expansion of the density- matrix, Eq. 5, into the natu- 
ral occupancy representation of Eq. 2. Making use of 
the transformation rules for Hubbard projectors given 
by Eq. 1, we find that 

^(/)(.)™ ^ ^ 0('^--'\^l2,\<P^)K(^^-^{M^li}). (8) 

We denote the overlap between Hubbard projectors 
and local orbitals {\4>a)} by 

v^2 = iM4i^)-^ w^ii = ^i^^ = (^^Pl</>a), (9) 

which may be very sparse matrices, particularly for a 
low density of subspaces. The DFT+J7 correction to 
the total-energy, given by Eqn. 4, is then computed with 
linear-scaling cost using the sparse matrix trace 

Eu=Yl ^Tr [OWKV{l - OWKV)] ^'^^"^ . (10) 

/,(T 

We have assumed throughout, for notational clarity, that 
identical Hubbard projectors are used for each spin chan- 
nel, although the generalization to cr-dependent Hubbard 
projectors, and thus a-dependent O, V, and W matrices, 
is straightforward. 

B. Efficient use of matrix sparsity 

The DFT+J7 functional of Eqn. 10, does not depend on 
the inter-site occupancy matrices generated using Hub- 
bard projectors for different subspaces, although the gen- 
eralization to inter-site occupancies, DFT+U+V, has 
been introduced in Ref. 30. In DFT+C/, these non-local 
occupancies do not contribute to Eu and thus should 
not be computed unnecessarily. On the other hand, it 
is undesirable from the point of view of both ease of im- 
plementation and computational efhciency to explicitly 
store separate V^-^\ M^*-^^ and O^^^ matrices for each site, 
thereby necessitating individual matrix products for each 
site before explicit summation in, for example, Eqn. 10. 

Our solution is to embed these small transformation 
matrices into large, though very sparse, V, W and O ma- 
trices for the entire system, where they then fit seamlessly 
into the hierarchical, parallelized, sparse algebra routines 



found in a contemporary linear-scaling DFT code^'^. The 
overlap O matrix is block-diagonal in either its covariant 
or contravariant form, the dimension of each block being 
the number of projectors spanning the subspace on the 
site in question, typically 5(7) for a subspace of 3d(4/) 
orbital symmetry. The Hubbard interaction parameters 
are also placed into a sparse matrix U for the entire sys- 
tem, of the same sparsity of O, although, in practice, 
diagonal in the simplest case of a scalar parameter on 
each site. The V matrix has the row sparsity of the or- 
bital overlap matrix S, depending on the orbital cutoff 
radii, and the column sparsity of O; W is its transpose. 

Let us take as an example the computation of the oc- 
cupancy matrix given by Eqn. 8. We henceforth sup- 
press the spin index for notational simplicity, on the 
understanding that the density kernel, its derivatives 
and derivatives with respect to it are generally spin- 
dependent. Working from left to right, temporarily plac- 
ing a site index before each projector index to clarify to 
which subspace it belongs, we first consider the product 

{owf^% = ^ o(^)"'('')"'V(^)^„^ (11) 

J 

which is a matrix with the same sparsity pattern as W 
due to the block-sparsity of the O matrix. Next, taking 
the product with the density kernel on the right, as per 

{OWKf''''" ^ iOwf^% X^", (12) 

we see that this matrix has the sparsity of WK, dense 
in the row index when no density-kernel truncation is 
applied. When kernel truncation is enforced, however, 
the number of values which a can take is reduced and 
the effort needed for the sum over /? is diminished. 
On the final step, where we compute 

n%),^-iOWKf^"''^V^^j^^, (13) 

we accumulate extraneous information on the inter- 
subspace non-locality of the density-matrix. Were we to 
compute this matrix in full and then consider its square, 
we would find that 

/ , " {K)m"' {I)m' r (I)m"' (I)m' ^ y^^) 

K 

the former being generated in the full matrix product, 
while only the latter is required in Eqn. 4. This problem 
is resolved by always truncating the occupancy matrix 

n%^^^{OWKf^"'-V^^,)^ (15) 

to the block-diagonal sparsity pattern as O in advance of 
computing such products, eliminating the off-site occu- 
pancies. In practice, the unnecessary elements are never 
actually computed, and no wasted effort is incurred, since 
the sparse algebra system computes only elements in the 
sparsity pattern of the product matrix^. 
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Matrix sparsity thus plays an important role in the 
construction of our linear-scaling DFT+[/ method, as it 
permits DFT+C/ calculations involving a large number 
of subspaces to be carried out efficiently. We hereafter 
suppress the site index, both to clarify the notation and 
to reflect the fact that the matrix operations are imple- 
mented in practice in terms of calls to sparse algebra 
subroutines which take the matrices V, W, O and U as 
arguments, and not their site-indexed counterparts. 



IV. OPTIMIZATION OF THE 
DENSITY-MATRIX 

In order to minimize the total-energy with respect to 
the density- matrix with linear-scaling cost, while afford- 
ing it the variational freedom of a systematically improv- 
able basis, it is performed in two nested conjugate gra- 
dients minimization loops. We first describe the inner 
loop, in Section IV A, a methodology common to many 
contemporary linear-scaling codes, where the energy is 
minimized with respect to the density kernel for a fixed 
set of local orbitals. 

In the outer loop, the local orbitals {|(/)q)} which span 
the Hilbert space available to the density-matrix are op- 
timized in order to minimize the total-energy. This tech- 
nique is used to obviate the choice of a fixed local orbital 
basis, and numerous variations have been previously de- 
scribed^'^^''^^"^^. The orbitals are assumed to be trun- 
cated to some region, in order to allow for linear-scaling 
cost, and refined with respect to the underlying basis, 
for a fixed density kernel K"^ , in a manner which is 
furthermore compatible with a linear-scaling method for 
optimizing local orbitals for unoccupied states recently 
proposed in Ref. 38. We return to discuss the orbital 
optimization technique in Section IV B. 

Recent success with the DFT+C/ method and its gen- 
eralization to inter-site interactions, DFT+U+V^^ , en- 
courages us to think of DFT+C/ as a true method for 
first-principles energetics"^^"^^ . We have therefore imple- 
mented the DFT+C/ forces terms, as well as the total- 
energy minimization scheme, in the ONETEP code of 
which the capability of accurately optimizing geometries 
has been previously demonstrated^'^. We describe the 
required methodology in Appendix C. 



optimization method required for idempotency preserva- 
tion in Appendix A. 

The DFT+C/ contribution to the Hamiltonian is thus 
simply given by the derivative of the DFT+C/ energy 
term of Eqn. 4 with respect to an arbitrary density kernel, 
that is 



1^" 2 1 dK^P 



dK- 



(16) 



In order to simplify this derivative, we begin by noting 
that the partial derivative of the occupation matrix with 
respect to the density kernel is given by 



dr 



d 



m 7 



Qm'rn"^r J7jfT/ 

m a pjn 



(17) 



The trace of this derivative over Hubbard projectors gives 
the covariant, local-orbital representation of the sum of 
projections over subspaces, which is a Hermitian tensor 
by construction, given by 



W 



P, 



13a- 



(18) 



It follows that the products of the occupancy matrix and 
its derivative, each always computed in terms of the Hub- 
bard projector indices, in practice, since there they have 
the block-diagonal sparsity pattern of O, are given by 



9n" 



' dn" 



{OWr^{PKV)p,^ and (19) 



{owKpy^^v,^,. 



(20) 



As a result, the DFT+C/ term in the covariant Hamilto- 
nian, denoted by , may be succinctly expressed as 



U 



{P-2PKP).. 



(21) 



The DFT+C/ contribution to the total-energy is effi- 
ciently computed, correspondingly, using the trace 



E, 



U 



{PK - PKPK)^ . 



(22) 



A. Kernel Optimization 

Minimization of the energy with respect to the density 
kernel is typically carried out. in practice, using a gener- 
alization of the Li-Nunes-Vanderbilt (LNV) technique'*"', 
which simultaneously drives the density-matrix to idem- 
potency while it evolves towards commutativity with its 
corresponding Kolm-Sham Hamiltonian. In this section 
however, for clarity, we assume that the energy may be 
straightforwardly minimized with respect to the density 
kernel. We return to the adaptations to the density kernel 



The DFT+C/ Hamiltonian and total-energy terms are 
added to their uncorrected DFT counterparts, giving 
Hap = H^if^ + H^p and E = Ed ft + Eu, respectively, 
and similarly for the independent-particle, or "band- 
structure" energy E^^ = Ejfpj. + Ejf , where EjjpT 
EW.^ = HPrxP'^ and Eu ^ Ef^ = HV.RP'^. 



^DFT — ^^ap ^^^"^ 7- 

For a refinement of the auxiliary density kernel if"'', 
any update to it must also be a contravariantly trans- 
forming tensor, as noted in Refs. 25 and 45. In order to 
provide such a search direction, it is necessary that we 
pre- and post-multiply the covariant gradient of Eqn. 21 
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with the contravariant metric tensor on the orbitals, that 
is their inverse overlap matrix evaluated at the point at 
which the gradient itself is computed, to give 



c'i^ = (s- 



(23) 



The inner product of two second-order tensors, X and 
y, is defined with respect to the metric S on the local 
orbitals. so that 



{X\Y)s = X'^^Yp^ = X'^^Sp^Y'T'Ss 



(24) 



This allows us to define the search direction norm 

1 /2 

||G||s = {G\G)y, and the conjugacy condition 
{Gi^i\Gi)s = 0, and thus to minimize the total-energy 
by iteratively updating the density kernel according to 



i+l 



(25) 



where the optimal step lengths {Xi} are computed using 
an appropriate non-linear conjugate gradients algorithm. 



B. Orbital optimization 

We now consider the optimization of the local or- 
bitals, specifically the DFT+[/ contribution to total- 
energy variations with respect to the expansion coeffi- 
cients of the orbitals in the underlying variational ba- 
sis. We again assume that the energy may be directly 
minimized in this section, for simplicity, returning to the 
alterations necessary for idempotency preservation under 
local-orbital optimization in Appendix B. This procedure 
occurs in the outer of the two energy minimization loops 
in the ONETEP code used here, however the results 
of this section apply to any technique which optimizes 
its representation functions for minimal energy, such as 
those described in Refs. 4, 15, 31-37. 

It is clear from Eqn. 22 that the derivative of the total- 
energy with respect to the expansion of the local orbitals 
on the grid (or, in general, the basis) may explicitly de- 
pend only on the matrix elements of the projection P, 
defined in Eqn. 18, so that 



dEr 



, ^ dEu dPf,^ 
(r) dP^^ 90„ (r) 



(26) 



Since this derivative involves the expansion of the Hub- 
bard projections on the grid, it incurs changes beyond 
simple linear mixing of the orbitals. Evaluating this, we 
first take the action of the DFT+[/ Hamiltonian contri- 
bution on the subspace projections, that is 



dEu _ U d 
dP, 



Pi 



2 ap, 



[{PK - pkpkV] 



(27) 



Pi 



P 



The Hubbard projection operators depend explicitly on 
the covariant orbitals which overlap with their corre- 
sponding Hubbard projectors (and Hubbard projector 



duals) and this dependence may be expressed as 



dP, 



Pi 



d(j)a (r) d(l)a (r) 

= '5?^™(r)0 



w„ 



(28) 



Combining this result with Eqs. 26 and 27, we may com- 
pute the DFT+[/ term in the local-orbital gradient, 



dEj. 



be (r) 



(r) (29) 



= 2K^'V,ra"H^ " Vm' (r) , 
where formally, though never explicitly in practice, 



(30) 



Due to the subspace- localized nature of the DFT+J7 cor- 
rection in the tensorial representation^^, only those lo- 
cal orbitals \(f)s) in Eqn. 29 which explicitly overlap with 
the Hubbard projectors | contribute and thus require 
summation over. 

Since, crucially, we require a covariantly transform- 
ing orbital update in order to improve upon those func- 
tions, to preserve their tensorial character, the above con- 
travariant gradient must be multiplied with the covariant 
metric tensor in order to provide the necessary covariant 
DFT+C/ orbital search direction term, given by 



(r) = 2SapKP'Vs,n"H^'^"^"'^„^, (r) . 



(31) 



This contribution may then be combined with the un- 
corrected DFT search direction, giving the total \ga) = 
Ida^'^) + Ida)- The inner product and norm of first-order 
tensors, or orbitals, \x) and \y), are defined such that 

{x\y)s^{x"\ya)^{xa\S"^\yp); (32) 



Ms = 



1/2 



(33) 



and, using these definitions, the total-energy may be min- 
imized by iteratively updating the orbitals, where the 
{/Lt*} are computed using one of many available non-linear 
conjugate gradients algorithms, according to 



(34) 



At each such orbital update step, in practice, we carry 
out a complete re-optimization of the density kernel, ac- 
cording to the procedure of Sub-section IV A. 



V. APPLICATION TO NICKEL OXIDE 
NANO-CLUSTERS 

We performed scaling tests on NiO nano-clusters of 
varying size, comparing the computational effort required 
for DFT+C/ and uncorrected DFT calculations. An an- 
tiferromagnetic insulator, NiO is a well known example 
where LSDA-type approximations^^ fail to qualitatively 
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FIG. 1. (Color online) Scaling of the energy mininiization 
algorithm for NiO nanoclusters of increasing size. Timings 
are for three density kernel optimization steps and one orbital 
optimization step, comparing DFT and DFT+f/ calculations. 
Simulations were performed on 300 Intel Westmere 2.67 Ghz 
cores connected using quad data rate Infiniband. 
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FIG. 2. (Color online) Computational time spent in subrou- 
tines associated with the DFT+(7 functionality in the tests 
shown in Fig. 1. Specifically timings shown are for comput- 
ing the DFT+C/ energy of Eqn. 4, the Hamiltonian matrix of 
Eqn. 21 in its Hubbard projector and local orbital represen- 
tations, and the ionic forces given by Eqn. C6. 



reproduce the correct insulating gap and local magnetic 
moments (of between 1.64 /i^ and 1.9 hb^^) due to a 
poor description of 3d orbital localization. The gap, of 
approximately 4 eV, is of predominantly Mott-Hubbard 
type since it persists above the Neel temperature^^, al- 
beit with a significant charge-transfer component^^. It 
is thus successfully recovered by a number of methods, 
which either include many-body Coulomb correlation ef- 
fects explicitly, such as LDA+DMFT^^, or introduce an 
appropriate derivative discontinuity with respect to oc- 
cupancy at the single-particle level, examples including 
unrestricted Hartree-Fock^^ and the self-interaction cor- 
rected local density approximation^^. The correct de- 
scription of the physics of NiO was an early success for 
DFT+C/, the method of interest here, and this has been 
repeated using numerous functional forms^-'^'^^''^''''^''"^^. 

The method described in this article has previously 
been successfully applied to bulk NiO^^. For a demon- 
stration of computational scaling, we have chosen spheri- 
cal nano-clusters of NiO with even numbers of nickel ions, 
so that an open-shell singlet multiplicity, analogous to the 
bulk antiferromagnetic ground state, could be tentatively 
assumed. We may expect that a transition to a ferrimag- 
netic or ferromagnetic state occurs below some critical 
cluster size, as it has been predicted for very small iron 
oxide clusters of interest for data-storage technology^^'^''. 

Run-time parameters included a 500 eV equivalent 
plane-wave cutoff energy, a spin polarized density ker- 
nel cutoff at 25 ao, the LSDA exchange-correlation func- 
tional^^, nine local orbitals (NGWFs) for each nickel ion 
and four each for oxygen, all with 7.5 ag cutoff radii, 
and norm-conserving pseudopotentials^^. Atomic Hub- 
bard projectors of hydrogenic form were used. Since cal- 
culations on nano-clusters of varying sizes are expected to 
exhibit differing convergence behavior, the energy min- 
imization algorithm was simply run for a fixed number 



of iterations. One orbital optimization step and three 
density kernel steps, with three penalty-functional idem- 
potency corrections iterations at each of the latter, were 
allowed. Orbital overlap matrix inversion was carried out 
using a sparse matrix implementation of Hotelling's al- 
gorithm^^ and a cubic supercell of length three times the 
diameter of each nano-cluster was used, up to a maxi- 
mum supercell length of approximately 300 ao. 



A. Scaling of computational effort for DFT+C/ 

Algorithmic timing data for ONETEP energy mini- 
mization of NiO nano-clusters, containing up to 7, 153 
atoms across 300 Intel Westmere 2.67 Ghz cores, is shown 
in Fig. 1. A reasonable linear fit was obtained for the 
timing; with a slightly negative fitted time intercept at 
450 — 500 atoms indicating a very efficient initialization 
of the pre-requisite data in these calculations. The NiO 
nano-clusters in question do not represent a favorable 
case for the DFT+C/ method, since approximately half 
of the ions host correlated subspaces. Nonetheless, we ob- 
served a very small increase in computational time when 
the DFT+C/ functionality was invoked, at approximately 
5 — 6%, and preservation of linear-scaling performance. 

Timings for generating the DFT+C/ Hamiltonian and 
its contribution to the total-energy and forces, for those 
calculations which fell within memory resources, are de- 
picted in Fig. 2. This indicates that no direct DFT+C/ 
functionality appreciably deviates from linear-scaling be- 
havior. We note, in particular, that the total time spent 
in these DFT+C/ specific subroutines makes up only a 
small fraction of the increase in cost incurred by DFT+C/, 
at less than 1% of the total computational time. 

In order to understand where the dominant contribu- 
tion to the DFT+C/ cost originates, since it is not di- 
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FIG. 3. (Color online) Nano-cluster size dependence of fill- 
ing factors of principal matrices affecting the computational 
cost of linear-scaling DFT and DFT+C/. These, namely, are 
the overlap between orbitals, the density kernel, the Hamilto- 
nian matrices conventional for DFT and for DFT+U, and the 
overlap matrices between the local orbitals and the non-local 
pseudopotential projectors and Ifubbard projectors. 



rectly in the DFT+U subroutines themselves, we direct 
the reader to Fig. 3, where the system dependent sparse 
matrix filhng is quantified. In a conventional DFT cal- 
culation, the sparsity of the Hamiltonian matrix is domi- 
nated by the orbital representation of the non-local pseu- 
dopotential, proportional to the product of the overlap 
matrix between the orbitals and the non-local pseudopo- 
tential projectors with its transpose. In essence, pairs of 
orbitals which overlap with a common non-local projec- 
tor contribute to the energy, and the same holds for the 
Hubbard projectors of DFT+J7. While non-local pseu- 
dopotential projectors tend to have cutoff radii not in 
excess of approximately 2 ag, Hubbard projectors of 3c? 
symmetry may require greater cutoff radii (for hydro- 
genic orbitals of effective nuclear charge in the typical 
range for transition-metals, Z = {8,9,10,11}, the nor- 
malization spillage at 2 ao is {9.3, 4.6, 2.1, 0.9}%). In our 
calculations the projector cutoff radius is set equal to 
that of the local orbitals, 7.5 ag. 

This increased Hamiltonian filling has consequences 
additionally for the calculation of energy gradients, as 
indicated in Fig. 4, which shows the fractional change 
in time spent in carrying out certain energy minimiza- 
tion operations. Most notably, it takes close to twice 
as much effort to calculate its expansion on the psinc 
grid due to the inclusion of Hubbard projectors in the 
Hamiltonian. The dominant part of the overall expense 
of the calculations is from operations on large matrices, 
however, so that grid expansion of the Hamiltonian is not 
significant for large systems. The incurred increase in the 
fining of the Hamiltonian matrix in DFT+U over DFT, 
and also in the expense of computing its products with 
quantities such as the density kernel and its expansion 
on the underlying, systematic plane- wave basis, is thus 
largely responsible for the observed, albeit moderate, in- 



FIG. 4. (Color online) Fractional change in time expended on 
energy-minimization operations when the DFT+U function- 
ality is activated. Namely, these are calculation of the local 
orbital (NGWF) gradient, expansion of the Hamiltonian in 
the basis, kernel optimization using the LNV method, and 
calculation of the matrix elements of the local potential. 



crease in computational expense indirectly introduced by 
DFT+C/, from 1% to 5 - 6%. 



VI. CONCLUDING REMARKS 

We have detailed a linear-scaling implementation of 
the DFT+C/ method for treating strongly-correlated sys- 
tems from first-principles. The formalism is generally 
appropriate to methods which minimize the energy with 
respect to the single-particle density-matrix, and allows 
for the optimization of both nonorthogonal Hubbard pro- 
jectors^^ and ionic positions. 

The preservation of linear-scaling performance on 
metal-oxide nano-clusters in excess of 7, 000 atoms is 
demonstrated. For systems of this type, with a high den- 
sity of correlated sites, the increase in computational pre- 
factor remains rather modest. The DFT+C/ functional- 
ity, furthermore, incurs negligible cost in large systems 
comprising only a small number of Hubbard subspaces. 

Ground state calculations employing our method have 
previously been demonstrated on both bulk and molec- 
ular strongly interacting systems^-^'^^, with further ex- 
amples on large-scale systems such as dilute magnetic 
semiconductor (Ga,Mn)As^^ and disordered V02^^, us- 
ing an extension of the method to DFT+DMFT, forth- 
coming. Further examples of candidate systems include 
organometallic molecules, such as metalloproteins and 
molecular magnets, where the method is particularly effi- 
cient for a low density of strongly interacting subspaces, 
and solids such as magnetic heterostructures, defective 
and doped oxides or catalytic interfaces with oxide sur- 
faces. We envisage that the technique described may aid 
in bringing linear-scaling DFT to bear on more challeng- 
ing systems than those to which it is has been typically 
applied to date. 
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Appendix A: Preservation of density-matrix purity 
under kernel optimization 

In the LNV method^^, the Kohn-Sham density kernel 
is related to the auxiliary density kernel via one iteration 
of the McWeeny purification transform, that is 



K 



{3LSL - 2LSLSL) 



(Al) 



In our treatment of DFT+C/, we go a step further and 
provide the more general expressions needed for the 
HSMP^° variant of the LNV method, in which the den- 
sity kernel K is expressed as a purified and normalized 
auxiliary density kernel, explicitly 



K 



N 



-K- 



a/3 



N {3LSL - 2LSLSLf^ 
S^s {iLSL - 2LSLSL) 



(A2) 



(57 ' 



where N is the correct occupancy for the spin channel in 
question. The kernel renormalization introduces terms in 
the gradient proportional to an effective chemical poten- 
tial, projecting out any first-order changes to the electron 
number, driving the density kernel K towards both nor- 
malization and idempotency as the energy is minimized. 

To locate the derivative of the DFT+J7 energy term 
with respect to the auxiliary density kernel, stressing that 
it is computed strictly using the purified and renormal- 
ized density kernel, we make use of the chain-rule for 
matrix derivatives to write 



dEu _ dEu dK-^^ 



(A3) 



where we carefully note that the partial derivative with 
respect to a doubly contravariant tensor is a doubly co- 
variant tensor with indices once permuted to allow for the 
complex case. It may be readily shown, using Eqn. Al, 
that the latter term is given by 



3 {5lSp,U' + L^'S.Jl) 



(A4) 



The derivative of the DFT+U" energy with respect to 
the purified density kernel K may be broken into prod- 
ucts of derivatives, and rearranged as follows 



d 



dEr, 



Eu 
N 



K 



(A5) 



S. 



(57 



We may next write the gradient with respect to the den- 
sity kernel in terms of a preconditioned contribution to 
the Hamiltonian, denoted by 



= Hs~, - H Sg. 



67 



(A6) 



where jj^ is identified as the DFT+[/ correction to the 
chemical potential, since 



dEu 



N 



N 



H. 



57 



(A7) 



^7 ■ 



It is worth noting that, just as the DFT+J7 independent- 
particle energy correction. Elf ~ H^^pK^", does not 
equal the energy term Eu, so the energy correction en- 
tering into the computation of ijF is not identical to the 
DFT+t/ correction to the total-energy. 

The required DFT+C/ energy gradient is provided by 
the product of the preconditioned term in the Hamilto- 
nian and the derivative of the density kernel with respect 
to its auxiliary counterpart, given by 



dEu 



N 



(A8) 



Finally, combining Eqs. A4 and A8, and recalling 
Eqn. 23; we arrive at the DFT+C/ contribution to the 
contravariant density kernel gradient. 



G" 



Q/3 



dEu 
N 



(A9) 



S^s {iLSL - 2LSLSLp 

3 {S-^HL + LHS- 
X < -2LHL 

-2 [S-^HLSL + LSLHS-^ 



Appendix B: Preservation of density-matrix purity 
under orbital optimization 

As in the case of the density kernel gradient, the orbital 
gradient is calculated using the purified and renormalized 
density kernel, and so it contains a preconditioning term 
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which drives the trace of the density-matrix to the correct 
occupancy of the system. The energy derivative with 
respect to covariant orbitals may be decomposed as 

dEu _ dEu (dK^^dK^' dK^A dS^,-, 



bo. (r) dkP^ \ dRS- 95c^ ^^C^ J (r) 
, dEu dPp^ 



dP^^ 90„ (r) ■ 



(Bl) 



The terms contained in parentheses in Eqn. Bl may be 
evaluated, respectively yielding 



5^1 



(B2) 



and 



The covariant metric explicitly depends only on the co- 
variant orbitals, so that 



ha (r) 



6^<j>^ (r) + (r) . 



(B3) 



Contraction of the DFT+C/ term in the Hamiltonian 
and the terms in Eqn. B2 provides a tensor Q which rep- 
resents a contribution to the local orbital gradient purely 
due to mixing among the orbitals, given by 



N 



(B4) 



3LHL - 2LHLSL - 2LSLHL 

To conclude, by combining Eqs. 29, Bl, and B4, the 
contravariant gradient of the DFT+J7 energy with re- 
spect to the orbitals is given by 



dE, 



= 2 {k'^^V^„,,H"''"'^^, + g"^0c) (r) , (B5) 

which is then transformed to the required covariant form, 
in the same manner as per Eq. 31, giving 

gZ (r) = 25„^i^^«Fc™"^""'"'¥'™' (r) (B6) 
+ 2SapQ^^4>Q (r) . 

Appendix C: Ionic forces 

We may assume that the ground-state density is lo- 
cated for a given ionic configuration before the forces are 
computed, so that the total-energy is variationally mini- 
mized with respect to both the orbital expansion coeffi- 
cients and the matrix elements of the density kernel. The 
DFT+[/ correction then contributes to the ionic forces 
only via the spatial dependence of the Hubbard projec- 
tion operators, that is for the ion labeled j, 

dEu dEu dP^H 



dn. 



(CI) 



The lattermost derivative may be expressed in terms of 
gradients of the covariant projectors and contravariant 
subspace metric tensors, specifically 



dP. 



dP^p dipUHr) 



dRi 



^ dipl^J (r) 



(C2) 



dPap dO- 



dO"^' 



dR, 



where, in the tensorial representation^'^ 



dRi 



dR, 



(C3) 



vanishes in the conventional case that the Hubbard pro- 
jectors are rigidly translated with their host ions. 

We note, next, that the Hubbard projectors are usu- 
ally considered to be associated with one atomic site only, 
and so the subspace index / need only run over subspaces 
centered on ion j. We may thus suppress the summation 
symbol in Eqn. CI, for notational clarity, since the gener- 
alization to multiple subspaces per ion is straightforward. 
Denoting the spatial derivative of the Hubbard projectors 
by the three-component vector 



■"ctm / ^ \ 

E 

ID] ■ 



(C4) 



J dr cj)a (r) J dG i-iG) e"'^' (G) 



the remaining terms in Eqn. C2 may be expressed as 



dP, 



Q/3 



d 



dR, 



d^ii^ (r) 
^ dR^ 



j dG e-'° V(p (G) 



(C5) 



dr 



W„.„p + V^^.O^ ^<\>f, ) (r) 



5^[|dG (-^G)e-'«>L'' (G) 



dr 



Combining the latter result with Eqs. 27 and CI, we 
conclude that the tensorially consistent DFT+[/ contri- 
bution to the ionic forces is succinctly given by the sparse 
matrix trace, noting the resemblance to Eqn. 29, 



9 7r-/3QT/ Tjm' ra-sj'{,3)\ 



(C6) 



This may be used, for example, to perform DFT+[/ cor- 
rected ionic geometry optimization, molecular dynamics, 
or calculations of vibrational spectra on large systems. 
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